-
Notifications
You must be signed in to change notification settings - Fork 577
Qualcomm AI Engine Direct - Refactor llama runner #10578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Refactor llama runner #10578
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10578
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a876626 with merge base c5dd476 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @cccclai, This PR is to refactor llama runner which is an infra change for adopting lookahead decoding and to support multi-turn conversation. Thanks |
Summary: - Refactored io_manager into five distinct components: - DecoderRunner: Module wrapper class. - PromptProcessor: Handles prompt processing using the decoder and key-value manager. - TokenGenerator: Generates tokens using the decoder and key-value manager. - KVManager: Manages key-value cache with kv_updater, including data buffer allocation, cache updates, and buffer updates in TensorImpl. - IBufferAlloc: Allocates data buffers from RPC memory or client buffer. - Support multi-turn use case. Validate on story llama - To simulate the scenario, I forced decode mode to generate 5 tokens each time. Tokens with random length are inserted after one round of prefill->decode finished.
4472c56
to
a876626
Compare
Hey sorry for being late on this PR, can you help rebasing? |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
If I apply
I got the results as "Once upon a time, there was", is it expected? |
Summary: Forward fix for pytorch#10578 Reviewed By: kimishpatel Differential Revision: D75536694
To run multiple prompts with ./qnn_llama3_2_runner, you should use the --prompt flag for each prompt. For example:
Otherwise, it will only execute the first prompt. |
Ah yes, I figure it out. Thanks |
Summary: